Classification
of data using decision tree [C50] and regression tree [rpart] methods
Tasks
covered:
Introduction
to data classification and decision trees
Read
csv data files into R
Decision
tree classification with C50
Decision
[regression] tree classification with rpart [R implementation of
CART]
Visualization
of decision trees
Data
classification and decision trees
Data classification is a machine
learning methodology that helps assign known class labels to unknown
data. This methodology is a supervised learning
technique that uses a training dataset labeled with known class
labels. The classification method develops a classification model [a
decision tree in this example exercise] using information from the
training data and a class purity algorithm. The resulting model can be
used to assign a known class to new unknown data.
A decision tree is the specific
model output of the two data classification techniques covered in this
exercise. A decision tree is a graphical representation of a rule set
that results in some conclusion, in this case, a classification of an
input data item. A principal advantage of decision trees is that they
are easy to explain and use. The rules of a decision tree follow a
basic format. The tree starts at a root node [usually placed at the
top of the tree]. Each node of the tree represents a rule whose result
splits the options into several branches. As the tree is traversed
downward a final leaf node is finally reached. This leaf node
determines the class assigned to the data. This process could also be
accomplished using a simple rule set [and most decision tree methods
can output a rule set] but, as stated above, the graphical tree
representation tends to be easier to explain to a decision maker.
This exercise will introduce and
experiment with two decision tree classification methods: C5.0 and
rpart.
C50
C50 is an R implementation of the
supervised machine learning algorithm C5.0 that can generate a
decision tree. The original algorithm was developed by Ross Quinlan.
It is an improved version of C4.5, which is based on ID3. This algorithm uses an information entropy
computation to determine the best rule that splits the data, at that
node, into purer classes by minimizing the computed entropy value.
This means that as each node splits the data, based on the rule at
that node, each subset of data split by the rule will contain less
diversity of classes and will, eventually, contain only one class
[complete purity]. This process is simple to compute and therefore C50
runs quickly. C50 is robust. It can work with both numeric or
categorical data [this example shows both types]. It can also tolerate
missing data values. The output from the R implementation can be
either a decision tree or a rule set. The output model can be used to
assign [predict] a class to new unclassified data items.
The R function rpart is an
implementation of the CART [Classification and Regression Tree] supervised machine learning
algorithm used to generate a decision tree. CART was developed by Leo
Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. CART is a
trademarked name of a particular software implementation. The R
implementation is called rpart for Recursive PARTitioning.
Like
C50, rpart uses a computational metric to determine
the best rule that splits the data, at that node, into purer
classes. In the rpart algorithm the computational metric is the Gini coefficient. At each node,
rpart minimizes the Gini coefficient and thus splits the data into purer
class subsets with the class leaf nodes at the bottom of the decision
tree. The process is simple to compute and runs fairly well, but our
example will highlight some computational issues. The
output from the R implementation is a decision tree that can be used
to assign [predict] a class to new unclassified data items.
The
datasets
This
example uses three datasets [in csv format]: Iris, Wine, and Titanic.
All three of these datasets are good examples to use with classification
algorithms. Both Iris and Wine are comprised of numeric data. Titanic is
entirely categorical data. All three datasets have one attribute that
can be designated as the class variable [Iris -> Classification (3
values), Wine -> Class (3 values), Titanic -> Survived (2
values)]. The Wine dataset introduces some potential computational
complexity because it has 14 variables.This complexity is a good test of
the performance of the two methods used in this exercise. The
descriptive statistics and charting of these datasets is left as an
additional exercise for interested readers.
Description
of the Titanic dataset: this file is a modified subset of the Kaggle
Titanic dataset. This version contains four categorical
attributes: Class, Age, Sex, and Survived]. Like the two datasets above,
it was downloaded from the UCI Machine Learning Repository. That dataset
no longer exists in that repository because of the existence of the
richer Kaggle version.
The
classification exercise - C50
This
exercise demonstrates decision tree classification, first using C50 and
then using rpart. Because the two methods use different purity metrics
and computational steps, it will compare and contrast the output of the
two methods.
The
C50 exercise begins by loading the package and reading in the file
Iris_Data.csv.
#
get the C5.0 package
install.packages('C50')
library('C50')
# load the package
ir
<- read.csv('Iris_Data.csv') # open iris dataset
The exercise looks at some
descriptive statistics and charts for the iris data. This can help
understand what potential patterns are found in the dataset. While
these tests are not explicitly included in the script for the other
datasets, an interested reader can adapt the commands and repeat the
descriptive analysis for themselves.
#
summary, boxplot, pairs plot
summary(ir)
boxplot(ir[-5],
main = 'Boxplot of Iris data by attributes')
pairs(ir[,-5],
main="Edgar Anderson's iris Data", pch=21, bg = c("black", "red",
"blue")[unclass(ir$Classification)])
The summary statistics provide
context to the planned classification task. There are three labels in
the Classification
attribute. These are the target classes for our decision tree. The
pairs plot [shown below] is helpful. The colors represent the three
iris classes. The black dots [Setosa instances] appear to be well
separated for the other dots. the other two groups [red and blue dots]
are not as well separated.
The function c5.0(
) trains the decision
tree model.
irTree
<- C5.0(ir[,-5], ir[,5])
This is the minimal set of input
arguments for this function. The first argument [the four flower
measurement attributes, minus the Classification attribute] identifies
the data that will be used to compute the information entropy and
determine the classification splits. The second argument [the
Classification attribute] identifies the class labels. The next two
commands view the model output
summary(irTree)
# view the model components
plot(irTree,
main = 'Iris decision tree') # view the model graphically
The summary( ) function is a C50
package version of the standard R library function. This version
displays the C5.0 model output in three sections. The first section is
the summary header. It states the function call, the class
specification, and states how many data instances were in the training
dataset. the second section displays a text version of the decision
tree.
PL
<= 1.9: Setosa (50)
PL
> 1.9:
:...PW
> 1.7: Virginica (46/1)
PW <= 1.7:
:...PL <= 4.9: Versicolor (48/1)
PL > 4.9: Virginica (6/2)
The first two output lines show a
split based on the PL attribute at the value 1.9. Less than or equal
to that value branches to a leaf node containing 50 instances of
Setosa items. If PL is greater than 1.9, the tree branches to a node
that splits based on the PW attribute at the value 1.7. Greater than
1.7 branches to a leaf node containing 46 instances of Virginica items
and one item that is not Virginica. If PW is less than or equal to
1.7, the tree branches to a node that splits based on the PL attribute
at the value 4.9. Less than or equal to that value
branches to a leaf node containing 48 instances of Versicolor items
and one item that is not Veriscolor. If PL is greater than 4.9, the
tree branches to a node 6 instances of Virginica items and
2 items that are not Virginica. The third section of the summary( )
output shows the analysis of the classification quality based on the
training data classification with this decision tree model.
Evaluation
on training data (150 cases):
Decision
Tree
----------------
Size Errors
4
4( 2.7%) <<
(a) (b) (c) <-classified
as
---- ---- ----
50
(a): class Setosa
47 3 (b): class Versicolor
1 49 (c): class Virginica
Attribute usage:
100.00% PL
66.67% PW
This output shows that the
decision tree has 4 leaf [classification] nodes and, using the
training data, resulted in four items being mis-classified [assigned a
class that does not match its actual class]. Below those results is a
matrix showing the test classification results. Each row represents
the number of data instances having a known class label [1st row (a) =
Setosa, 2nd row (b) = Versicolor, 3rd row (c) = Virginica]. Each
column indicated the number of data instances classified under a given
label [(a) = Setosa, (b) = Versicolor, (c) = Virginica]. All 50 Setosa
data instances were classified correctly. 47 Versicolor data instances
were classified as Versicolor and 3 were classified as Virginica. 1
Virginica data instance was classified as Versicolor and 49 were
classified correctly. Only two of the four flower measurement
attributes [PL and PW] were used in the decision tree. Here is a
graphical plot of the decision tree produced by the plot(
) function included
in the C50 package [this function overloads the base plot(
) function for C50
tree objects]
plot(irTree,
main = 'Iris decision tree') # view the model graphically
This decision tree chart depicts
the same information as the text-based tree shown above, but it is
visually more appealing. Each of the split nodes and the splitting
criterion is easily understood. The classification results are
represented as percentiles, with the total number of data instances in
each leaf node listed at the top of the node. this chart shows why
decision trees are easy to understand. The output from C50 can be
represented by a rule set instead of a decision tree. This is helpful
if the model results are intended to be converted into programming
code in another computer language. Explicit rules are easier to
convert versus the tree split criterion.
#
build a rules set
irRules
<- C5.0(ir[,-5], ir[,5], rules = TRUE)
summary(irRules)
# view the ruleset
The c5.0(
) function is
modified with an additional argument [rules
= TRUE]. This changes
the model output from a decision tree to a decision rule set. The
summary output of the rule set is similar to the summary output from
the decision tree except that the text-based decision tree is replaced
by a rule set.
Rules:
Rule
1: (50, lift 2.9)
PL <= 1.9
-> class Setosa [0.981]
Rule
2: (48/1, lift 2.9)
PL > 1.9
PL <= 4.9
PW <= 1.7
-> class Versicolor [0.960]
Rule
3: (46/1, lift 2.9)
PW > 1.7
-> class Virginica [0.958]
Rule
4: (46/2, lift 2.8)
PL > 4.9
-> class Virginica [0.938]
Default
class: Setosa
The rule set corresponds to the
decision tree leaf nodes [one rule per leaf node], but a careful
review of the rules reveals that the some rules are different than the
decision tree branches [Rule 3 for example]. The rules are applied in
order. Any data items not classified by the first rule are tested by
the second rule. This process proceeds through each rule. Any data
items that pass through all of the rules without being classified are
trapped by the final default rule. The strength of each rule is
indicated by the number next to the class label. The lift metric is a
measure of the performance of the rule at predicting the class [larger
lift = better performance]. This is included to trap data that does
not conform to the domain of the original training data set.
Before moving on to the next data
set, this exercise provides and example of how to use the predict( )
function in this package. Unknown data can be classified using the
trained tree model in the function predict(
). The exercise uses
all of the iris dataset in this example instead of actual unclassified
data. While being artificial, it allows the example to walk through a
procedure that matches the new classifications to the actual
classifications. After labeling the data with predicted classes, the
prediction data set is compared to the actual set and the
mis-classified instanced are found. This section is left as an
exercise for the interested reader to explore.
The next section of the example
computes a C5.0 decision tree and rule set model for the Wine dataset.
This dataset is interesting because it consists of 13 numeric
attributes representing the chemical and physical properties of the
wine and a class attribute which is used as the target classes for the
classification model.
wine
<- read.csv('Wine.csv') # read the dataset
head(wine) # look at the 1st 6 rows
wTree <- C5.0(wine[,-14],
as.factor(wine[,14])) # train the tree
The command head(wine)shows that the 14 attributes of
the Wine dataset are all numeric. This means that the Class attribute
must be defined as a factor in C5.0 so that it can be the target
classes for the decision tree. C5.0 runs quickly and has no difficulty
computing the information entropy and discovering the best split
points. Here is the summary output and final decision tree:
summary(wTree)
# view the model components
Decision
tree:
Flavanoids
<= 1.57:
:...Color.intensity
<= 3.8: 2 (13)
:
Color.intensity > 3.8: 3 (49/1)s
Flavanoids
> 1.57:
:...Proline
<= 720: 2 (54/1)
:
Proline > 720:
:...Color.intensity <= 3.4: 2 (4)
: Color.intensity > 3.4: 1 (58)
Evaluation
on training data (178 cases):
Decision Tree
----------------
Size Errors
5 2
(1.1%) <<
(a) (b) (c) <-classified as
---- ---- ----
58
1 (a): class 1
70 1 (b): class
2
48 (c): class 3
Attribute
usage:
100.00%
Flavanoids
69.66% Color.intensity
65.17% Proline
plot(wTree,
main = 'Wine decision tree') # view the model graphically
Notice that C5.0 only uses three
of the 14 attributes in the decision tree. These three
attributes provide enough information to split the dataset into refines
class subsets at each tree node. Here is the C5.0 rule set:
wRules
<- C5.0(wine[,-14], as.factor(wine[,14]), rules = TRUE)
summary(wRules)
# view the ruleset
Rules:
Rule
1: (58, lift 3.0)
Flavanoids > 1.57
Color.intensity > 3.4
Proline > 720
-> class 1 [0.983]
Rule
2: (55, lift 2.5)
Color.intensity <= 3.4
-> class 2 [0.982]
Rule
3: (54/1, lift 2.4)
Flavanoids > 1.57
Proline <= 720
-> class 2 [0.964]
Rule
4: (13, lift 2.3)
Flavanoids <= 1.57
Color.intensity <= 3.8
-> class 2 [0.933]
Rule
5: (49/1, lift 3.6)
Flavanoids <= 1.57
Color.intensity > 3.8
-> class 3 [0.961]
Default
class: 2
Attribute
usage:
97.75% Flavanoids
92.13% Color.intensity
62.92% Proline
These rules resemble the split
conditions in the decision tree and the same subset of three
attributes are used in the rule set as in the decision tree, but the
attribute usage percentages are different for the rule set versus the
decision tree. This result is a result of the sequential application
of the rules versus the split criterion applied in a decision tree.
The third example uses the Titanic
dataset. As stated above, this dataset consists of 2201 rows and four
categorical attributes [Class, Age, Sex, and Survived].
tn
<- read.csv('Titanic.csv') # load the dataset into an object
head(tn)
# view the first six rows of the dataset
Train a decision tree and view the
results
tnTree
<- C5.0(tn[,-4], tn[,4])
plot(tnTree,
main = 'Titanic decision tree') #view the tree
This decision tree uses only two
tests for its classifications. The final leaf nodes look different
than the two tree above. Since there are only two classes [Survived =
Yes, No], the leaf nodes show the proportions for each class within
each node. Here is the summary for the decision tree:
summary(tnTree)
# view the tree object
Decision
tree:
Sex
= Male: No (1731/367)
Sex
= Female:
:...Class
in {Crew,First,Second}: Yes (274/20)
: Class = Third: No (196/90)
Evaluation
on training data (2201 cases):
Decision Tree
----------------
Size Errors
3 477 (21.7%)
<<
(a) (b) <-classified as
---- ----
1470 20 (a): class No
457 254 (b): class Yes
Attribute usage:
100.00% Sex
21.35% Class
Train a rule set for the dataset
and view the results
tnRules
<- C5.0(tn[,-4], tn[,4], rules = TRUE)
summary(tnRules)
# view the ruleset
Rules:
Rule
1: (1731/367, lift 1.2)
Sex = Male
-> class No [0.788]
Rule
2: (706/178, lift 1.1)
Class = Third
-> class No [0.747]
Rule
3: (274/20, lift 2.9)
Class in {Crew, First, Second}
Sex = Female
-> class Yes [0.924]
Default
class: No
Attribute
usage:
91.09% Sex
44.53% Class
The
classification exercise - rpart
The rpart( ) function trains a
classification regression decision tree using the Gini index as its
class purity metric. Since this algorithm is different from the
information entropy computation used in C5.0, it may compute different
splitting criterion for its decision trees. The rpart(
) function uses a
pre-specified regression function as its first argument. The format
for this function is: Class variable ~ input variable A + input
variable B + [any other input variables]. The examples in this
discussion will use all of the dataset attributes as input variables
and let rpart select the best ones for the decision tree model.
Additionally, the summary of an rpart decision tree object is very
different from the summary of a C5.0 decision tree object.
Here is the code for the first
example that trains a rpart regression decision tree with the iris
dataset
#
create a label for our formula
f
= ir$Classification ~ ir$SL + ir$SW + ir$PL + ir$PW
#
train the tree
irrTree
= rpart(f, method = 'class')
#
view the tree summary
summary(irrTree)
The first command defines the
regression formula f. All four measurement
attributes are used in the regression formula to predict the
Classification attribute. The second command trains a regression
classification tree using the formula f. The third command prints out
the summary of the regression tree object. Here is the output from summary(irrTree)
Call:
rpart(formula
= f, method = "class")
n= 150
CP nsplit rel error xerror
xstd
1
0.50 0 1.00
1.21 0.04836666
2
0.44 1 0.50 0.66
0.06079474
3
0.01 2
0.06 0.12 0.03322650
Variable
importance
ir$PW
ir$PL ir$SL ir$SW
34 31 21 13
Node
number 1: 150 observations, complexity param=0.5
predicted class=Setosa expected
loss=0.6666667 P(node) =1
class counts: 50 50 50
probabilities: 0.333 0.333 0.333
left son=2 (50 obs) right son=3 (100 obs)
Primary splits:
ir$PL < 2.45 to the left, improve=50.00000, (0 missing)
ir$PW < 0.8 to the left, improve=50.00000, (0 missing)
ir$SL < 5.45 to the left, improve=34.16405, (0 missing)
ir$SW < 3.35 to the right, improve=18.05556, (0 missing)
Surrogate splits:
ir$PW < 0.8 to the left, agree=1.000, adj=1.00, (0 split)
ir$SL < 5.45 to the left, agree=0.920, adj=0.76, (0 split)
ir$SW < 3.35 to the right, agree=0.827, adj=0.48, (0 split)
Node
number 2: 50 observations
predicted class=Setosa expected loss=0 P(node)
=0.3333333
class counts: 50 0 0
probabilities: 1.000 0.000 0.000
Node
number 3: 100 observations, complexity
param=0.44
predicted class=Versicolor expected loss=0.5 P(node)
=0.6666667
class counts: 0 50 50
probabilities: 0.000 0.500 0.500
left son=6 (54 obs) right son=7 (46 obs)
Primary splits:
ir$PW < 1.75 to the left, improve=38.969400, (0 missing)
ir$PL < 4.75 to the left, improve=37.353540, (0 missing)
ir$SL < 6.15 to the left, improve=10.686870, (0 missing)
ir$SW < 2.45 to the left, improve= 3.555556, (0 missing)
Surrogate splits:
ir$PL < 4.75 to the left, agree=0.91, adj=0.804, (0 split)
ir$SL < 6.15 to the left, agree=0.73, adj=0.413, (0 split)
ir$SW < 2.95 to the left, agree=0.67, adj=0.283, (0 split)
Node
number 6: 54 observations:
predicted class=Versicolor expected loss=0.09259259 P(node) =0.36
class counts: 0 49 5
probabilities: 0.000 0.907 0.093
Node
number 7: 46 observations
predicted class=Virginica expected loss=0.02173913 P(node)
=0.3066667
class counts: 0 1 45
probabilities: 0.000 0.022 0.978
This output provides the details
of how rpart( ) selects the attribute and value at each split point
[nodes 1 and 3]. Node 1 is the root node. There are 50 instances of
each class at this node. Four primary split choices and three
surrogate split choices are shown [best choice first]. The best split
criterion [ir$PL < 2.45] splits the data left to node 2 [50
instances of Setosa] and right to node 3 [100 instances of both
Versicolor and Virginica]. Note that the split criterion used by rpart
are different than the split criterion produced by C5.0. This
difference is due to the different splitting algorithm [information
entropy versus GINI] used by each method. Node 2 contains 50 instances
of Setosa. Since this node is one pure class, no additional split is
needed. Node 3 splits the data based on the best primary split choice
[ir$PW < 1.75] left to node 6 and right to node 7. This node
numbering is an unusual behavior of rpart( ) and, if this regression
tree went deeper, the node numbers would increase by jumps at each
deeper level. Node 6 contains 54 data instances, with 49 Versicolor
instances and 5 Virginica instances. Node 7 contains 46 instances, with
1 Versicolor instance and 45 Virginica instances.
A text version of the tree is
displayed using the command:
print(irrTree)
# view a text version of the tree
n=
150
node),
split, n, loss, yval, (yprob)
* denotes terminal node
1)
root 150 100 Setosa (0.33333333 0.33333333 0.33333333)
2) ir$PL< 2.45 50 0 Setosa (1.00000000 0.00000000 0.00000000) *
3) ir$PL>=2.45 100 50 Versicolor (0.00000000 0.50000000
0.50000000)
6) ir$PW< 1.75 54 5 Versicolor (0.00000000 0.90740741 0.09259259)
*
7) ir$PW>=1.75 46 1 Virginica (0.00000000 0.02173913 0.97826087)
*
The plot(
) function included
in the rpart package is functional, but it produces a tree that is not
as visually appealing as the one included in C5.0. Three commands are
needed to plot the regression tree:
par(xpd
= TRUE) # define graphic parameter
plot(irrTree,
main = 'Iris regresion tree') # plot the tree
text(irrTree,
use.n = TRUE) # add text labels to tree
The plot(
) function draws the
tree and displays the chart title. The text(
) function adds the
node labels, indicating the split criterion at the interior nodes and
the classification results at the leaf nodes. Note that the leaf nodes
also show the counts of each actual class at the leaf nodes.
For
a more visually appealing regression tree the rpart.plot package can be
used. Here is the same iris regression tree using rpart.plot(
):
rpart.plot(irrTree,
main = 'Iris regresion tree') # a better tree plot
One
function draws this chart. The interior nodes are colored in a light
shade of the target class. The node shows the proportion of each class
at that node and the percentage of the dataset at that node. The leaf
nodes are colored based on the node class. The node shows the proportion
of each class at that node and the percentage of the correct class from
the dataset at that node. This display of percentages differs from the
class counts displayed by the rpart plot(
) function.
Here
is the second example of a rpart regression decision tree using the wine
dataset.
f
<- wine$Class ~ wine$Alcohol + wine$Malic.acid + wine$Ash +
wine$Alcalinity.of.ash + wine$Magnesium + wine$Total.phenols +
wine$Flavanoids + wine$Nonflavanoid.phenols + wine$Proanthocyanins +
wine$Color.intensity + wine$Hue + wine$OD280.OD315.of.diluted.wines +
wine$Proline
winerTree
= rpart(f, method = 'class') # train the tree
summary(winerTree)
# view the tree summary
The
first command defines the regression formula f. All
thirteen measurement attributes are used in the regression formula to
predict the class attribute. The second command trains a regression
classification tree using the formula f. The third command prints out
the summary of the regression tree object. Here is the output from summary(winerTree).
Call:
rpart(formula
= f, method = "class")
n= 178
CP nsplit rel error
xerror xstd
1
0.49532710 0 1.0000000 1.0000000
0.06105585
2
0.31775701 1 0.5046729 0.4859813
0.05670132
3
0.05607477 2 0.1869159 0.3364486
0.05008430
4
0.02803738 3 0.1308411 0.2056075
0.04103740
5
0.01000000 4 0.1028037 0.1588785
0.03664744
Variable
importance
wine$Flavanoids wine$OD280.OD315.of.diluted.wines
18
17
wine$Proline
wine$Alcohol
13
12
wine$Hue
wine$Color.intensity
10
9
wine$Total.phenols
wine$Proanthocyanins
8
7
wine$Alcalinity.of.ash
wine$Malic.acid
6
1
Node
number 1: 178 observations, complexity param=0.4953271
predicted class=2 expected loss=0.6011236 P(node) =1
class counts: 59 71 48
probabilities: 0.331 0.399 0.270
left son=2 (67 obs) right son=3 (111 obs)
Primary splits:
wine$Proline
< 755 to the right, improve=44.81780, (0 missing)
wine$Color.intensity
< 3.82 to the left, improve=43.48679, (0 missing)
wine$Alcohol
< 12.78 to the right, improve=40.45675, (0 missing)
wine$OD280.OD315.of.diluted.wines < 2.115 to the right,
improve=39.27074, (0 missing)
wine$Flavanoids
< 1.4 to the right, improve=39.21747, (0 missing)
Surrogate splits:
wine$Flavanoids
< 2.31 to the right, agree=0.831, adj=0.552, (0 split)
wine$Total.phenols
< 2.335 to the right, agree=0.781, adj=0.418, (0 split)
wine$Alcohol
< 12.975 to the right, agree=0.775, adj=0.403, (0
split)
wine$Alcalinity.of.ash
< 17.45 to the left, agree=0.770, adj=0.388, (0 split)
wine$OD280.OD315.of.diluted.wines < 3.305 to the right, agree=0.725,
adj=0.269, (0 split)
Node
number 2: 67 observations, complexity param=0.05607477
predicted class=1 expected loss=0.1492537 P(node) =0.3764045
class counts: 57 4 6
probabilities: 0.851 0.060 0.090
left son=4 (59 obs) right son=5 (8 obs)
Primary splits:
wine$Flavanoids
< 2.165 to the right, improve=10.866940, (0 missing)
wine$Total.phenols
< 2.05 to the right, improve=10.317060, (0 missing)
wine$OD280.OD315.of.diluted.wines < 2.49 to the right,
improve=10.317060, (0 missing)
wine$Hue
< 0.865 to the right, improve= 8.550391, (0 missing)
wine$Alcohol
< 13.02 to the right, improve= 5.273716, (0 missing)
Surrogate splits:
wine$Total.phenols
< 2.05 to the right, agree=0.985, adj=0.875, (0
split)
wine$OD280.OD315.of.diluted.wines < 2.49 to the right, agree=0.985,
adj=0.875, (0 split)
wine$Hue
< 0.78 to the right, agree=0.970, adj=0.750, (0 split)
wine$Alcohol
< 12.46 to the right, agree=0.940, adj=0.500, (0 split)
wine$Proanthocyanins
< 1.195 to the right, agree=0.925, adj=0.375, (0 split)
Node
number 3: 111 observations, complexity param=0.317757
predicted class=2 expected loss=0.3963964 P(node) =0.6235955
class counts: 2 67 42
probabilities: 0.018 0.604 0.378
left son=6 (65 obs) right son=7 (46 obs)
Primary splits:
wine$OD280.OD315.of.diluted.wines < 2.115 to the right,
improve=36.56508, (0 missing)
wine$Color.intensity
< 4.85 to the left, improve=36.17922, (0 missing)
wine$Flavanoids
< 1.235 to the right, improve=34.53661, (0 missing)
wine$Hue
< 0.785 to the right, improve=28.24602, (0 missing)
wine$Alcohol
< 12.745 to the left, improve=23.14780, (0 missing)
Surrogate splits:
wine$Flavanoids < 1.48 to the right,
agree=0.910, adj=0.783, (0 split)
wine$Color.intensity < 4.74 to the left, agree=0.901, adj=0.761, (0
split)
wine$Hue
< 0.785 to the right, agree=0.829, adj=0.587, (0 split)
wine$Alcohol < 12.525
to the left, agree=0.802, adj=0.522, (0 split)
wine$Proanthocyanins < 1.285 to the right, agree=0.775, adj=0.457, (0
split)
Node
number 4: 59 observations
predicted class=1 expected loss=0.03389831 P(node) =0.3314607
class counts: 57 2 0
probabilities: 0.966 0.034 0.000
Node
number 5: 8 observations
predicted class=3 expected loss=0.25 P(node) =0.04494382
class counts: 0 2 6
probabilities: 0.000 0.250 0.750
Node
number 6: 65 observations
predicted class=2 expected loss=0.06153846 P(node) =0.3651685
class counts: 2 61 2
probabilities: 0.031 0.938 0.031
Node
number 7: 46 observations, complexity param=0.02803738
predicted class=3 expected loss=0.1304348 P(node) =0.258427
class counts: 0 6 40
probabilities: 0.000 0.130 0.870
left son=14 (7 obs) right son=15 (39 obs)
Primary splits:
wine$Hue
< 0.9 to the right, improve=5.628922, (0 missing)
wine$Malic.acid < 1.6 to the left,
improve=4.737414, (0 missing)
wine$Color.intensity < 4.85 to the left, improve=4.044392, (0
missing)
wine$Proanthocyanins < 0.705 to the left, improve=3.211339, (0
missing)
wine$Flavanoids < 1.29 to the right,
improve=2.645309, (0 missing)
Surrogate splits:
wine$Alcalinity.of.ash < 17.25 to the left, agree=0.935, adj=0.571,
(0 split)
wine$Color.intensity < 3.56 to the left, agree=0.935,
adj=0.571, (0 split)
wine$Malic.acid < 1.17 to
the left, agree=0.913, adj=0.429, (0 split)
wine$Proanthocyanins < 0.485 to the left, agree=0.913,
adj=0.429, (0 split)
wine$Ash < 2.06 to
the left, agree=0.891, adj=0.286, (0 split)
Node
number 14: 7 observations
predicted class=2 expected loss=0.2857143 P(node) =0.03932584
class counts: 0 5 2
probabilities: 0.000 0.714 0.286
Node
number 15: 39 observations
predicted class=3 expected loss=0.02564103 P(node) =0.2191011
class counts: 0 1 38
probabilities: 0.000 0.026 0.974
This
output provides the details of how rpart( ) selects the attribute and
value at each split point [nodes 1, 2, 3, and 7]. Node 1 is the root
node. Note that the split criterion used by rpart are different than the
split criterion produced by C5.0. This difference is due to the
different splitting algorithm [information entropy versus GINI] used by
each method. As stated with the regression classification of the iris
dataset, the node numbering is not completely sequential and is an
unusual behavior of rpart( ) and, if this regression tree went deeper,
the node numbers would increase by jumps at each deeper level.
A
text version of the tree is displayed using the command:
print(winerTree)
# view a text version of the tree
n=
178
node),
split, n, loss, yval, (yprob)
* denotes terminal node
1)
root 178 107 2 (0.33146067 0.39887640 0.26966292)
2) wine$Proline>=755 67 10 1 (0.85074627 0.05970149 0.08955224)
4) wine$Flavanoids>=2.165 59 2 1 (0.96610169 0.03389831 0.00000000)
*
5) wine$Flavanoids< 2.165 8 2 3 (0.00000000 0.25000000 0.75000000)
*
3) wine$Proline< 755 111 44 2 (0.01801802 0.60360360 0.37837838)
6) wine$OD280.OD315.of.diluted.wines>=2.115 65 4 2 (0.03076923
0.93846154 0.03076923) *
7) wine$OD280.OD315.of.diluted.wines< 2.115 46 6 3 (0.00000000
0.13043478 0.86956522)
14) wine$Hue>=0.9 7 2 2 (0.00000000 0.71428571 0.28571429) *
15) wine$Hue< 0.9 39 1 3 (0.00000000 0.02564103 0.97435897) *
Several
points are worth noticing in this regression tree. Only four of the
thirteen attributes are used in splitting the data into classes.
Additionally, while the goal is to classify the data into each of three
classes, the regression tree uses five leaf nodes to accomplish this
task. This result is an indicator that there are no definite class
boundaries in this data. This differs from the definite boundary in the
iris dataset between the Setosa class and the rest of the data.
Here
is the output from the plot( ) function included in the rpart
package:
par(xpd
= TRUE) # define graphic parameter
plot(winerTree,
main = 'Wine regresion tree') # plot the tree
text(winerTree,
use.n = TRUE) # add text labels to tree
Here is the same wine regression
tree using rpart.plot( ):
rpart.plot(winerTree,
main = 'Wine regresion tree') # a better tree plot
The
third example of rpart decision tree classification uses the Titanic
dataset. Remember, that this dataset consists of 2201 rows and four
categorical attributes [Class, Age, Sex, and Survived].
f
= tn$Survived ~ tn$Class + tn$Age + tn$Sex # declare the regression
formula
tnrTree
= rpart(f, method = 'class') # train the tree
The
first command defines the regression formula f. The Class, Age, and Sex
attributes are used in the regression formula to predict the Survived attribute.
The second command trains a regression classification tree using the
formula f. The third command prints out the summary of the regression
tree object. Here is the output from summary(tnrTree).
Call:
rpart(formula
= f, method = "class")
n= 2201
CP nsplit rel error
xerror xstd
1
0.30661041 0 1.0000000 1.0000000 0.03085662
2
0.02250352 1 0.6933896 0.6933896
0.02750982
3
0.01125176 2 0.6708861 0.6863572
0.02741000
4
0.01000000 4 0.6483826 0.6765120
0.02726824
Variable
importance
tn$Sex tn$Class tn$Age
73 23
4
Node
number 1: 2201 observations, complexity param=0.3066104
predicted class=No expected loss=0.323035 P(node) =1
class counts: 1490 711
probabilities: 0.677 0.323
left son=2 (1731 obs) right son=3 (470 obs)
Primary splits:
tn$Sex splits as RL, improve=199.821600, (0
missing)
tn$Class splits as LRRL, improve= 69.684100, (0 missing)
tn$Age splits as LR, improve= 9.165241, (0
missing)
Node
number 2: 1731 observations, complexity param=0.01125176
predicted class=No expected loss=0.2120162 P(node) =0.7864607
class counts: 1364 367
probabilities: 0.788 0.212
left son=4 (1667 obs) right son=5 (64 obs)
Primary splits:
tn$Age splits as LR, improve=7.726764, (0
missing)
tn$Class splits as LRLL, improve=7.046106, (0 missing)
Node
number 3: 470 observations, complexity param=0.02250352
predicted class=Yes expected loss=0.2680851 P(node) =0.2135393
class counts: 126 344
probabilities: 0.268 0.732
left son=6 (196 obs) right son=7 (274 obs)
Primary splits:
tn$Class splits as RRRL, improve=50.015320, (0 missing)
tn$Age splits as RL, improve= 1.197586, (0
missing)
Surrogate splits:
tn$Age splits as RL, agree=0.619, adj=0.087, (0 split)
Node
number 4: 1667 observations
predicted class=No expected loss=0.2027594 P(node) =0.757383
class counts: 1329 338
probabilities: 0.797 0.203
Node
number 5: 64 observations, complexity param=0.01125176
predicted class=No expected loss=0.453125 P(node) =0.02907769
class counts: 35 29
probabilities: 0.547 0.453
left son=10 (48 obs) right son=11 (16 obs)
Primary splits:
tn$Class splits as -RRL, improve=12.76042, (0 missing)
Node
number 6: 196 observations
predicted class=No expected loss=0.4591837 P(node) =0.08905043
class counts: 106 90
probabilities: 0.541 0.459
Node
number 7: 274 observations
predicted class=Yes expected loss=0.0729927 P(node) =0.1244889
class counts: 20 254
probabilities: 0.073 0.927
Node
number 10: 48 observations
predicted class=No expected loss=0.2708333 P(node) =0.02180827
class counts: 35 13
probabilities: 0.729 0.271
Node
number 11: 16 observations
predicted class=Yes expected loss=0 P(node) =0.007269423
class counts: 0 16
probabilities: 0.000 1.000
Here is the output from the plot( )
function included in the rpart package:.
par(xpd
= TRUE) # define graphic parameter
plot(tnrTree,
main = 'Titanic regresion tree') # plot the tree
text(tnrTree,
use.n = TRUE) # add text labels to tree
Here is the same titanic
regression tree using rpart.plot( ):
rpart.plot(tnrTree,
main = 'Titanic regresion tree') # a better tree plot
This concludes the classification
exercise using C5.0 and rpart. Hopefully, this can help you set
up a classification model with either of these methods.